Skip to content

feat: per-session pipelining via seq-ordered data ops#903

Open
dazzling-no-more wants to merge 1 commit intotherealaleph:mainfrom
dazzling-no-more:feature/tunnel-session-pipelining
Open

feat: per-session pipelining via seq-ordered data ops#903
dazzling-no-more wants to merge 1 commit intotherealaleph:mainfrom
dazzling-no-more:feature/tunnel-session-pipelining

Conversation

@dazzling-no-more
Copy link
Copy Markdown
Contributor

Summary

Per-session pipelining for full-tunnel mode. Allows up to 2 in-flight data ops per session to overlap one Apps Script round-trip with the next on a single TCP stream, lifting the per-session throughput cap that came from strict request/await/request sequencing.

Wire protocol (forward-compatible — old peers ignore the new fields):

  • BatchOp.seq: Option<u64> — per-session monotonic sequence number, sent only on data ops in pipelined sessions.
  • TunnelResponse.seq: Option<u64> — echoed by the server.
  • TunnelResponse.caps: Option<u32> — advertised on connect / connect_data success replies. Bit 0 = CAPS_PIPELINE_SEQ.
  • u64 rather than u32 so a long-lived TCP session generating ~100 ops/s doesn't saturate after ~1.4 years.

Tunnel-node (new): when a data op carries seq, the server processes the entire (write, drain) sequence under a per-session seq lock and only releases it after expected advances. Two ops can travel through different deployments and arrive out of order — the seq lock guarantees they're processed (and replied to) in send order, so a client reorder buffer never sees bytes from seq=N+1 land before seq=N's bytes. Old data ops without seq continue down the legacy unordered path. connect/connect_data success replies set caps: Some(CAPS_PIPELINE_SEQ) — old clients ignore the unknown field, new clients opt into pipelining.

Client (new): tunnel_loop_pipelined keeps up to PIPELINE_DEPTH = 2 data ops in flight per pipelined session. Reordering is implicit — VecDeque<(seq, oneshot::Receiver)> awaited in FIFO order means replies for later seqs sit in their oneshot buffers until our task gets to them. Idle sessions stay at depth=1 (server long-polls; no speculative quota burn). Pipelining only enables when the connect reply advertised caps AND no prior reply has been observed dropping the seq field (mixed-version backend protection).

Cross-cutting correctness:

  • Per-batch BatchWait primitive replaces N×M per-job watchers with one watcher per unique SessionInner (deduplicated by Arc::as_ptr). reader_task switched to notify_waiters() so concurrent batches all wake on each push.
  • pipelined_reply_timeout derived as effective_batch * 2 + slack — covers permit-saturated semaphore wait plus the op's own request budget.
  • Active seq ops (had_uplink == true) wait per-session up to ACTIVE_DRAIN_DEADLINE and run a straggler-settle loop, so multi-packet upstream responses land in one drain instead of being split across batches. Idle ops still subscribe to the shared BatchWait for cross-session wake.
  • close ops collected during dispatch and processed AFTER seq jobs ONLY when an earlier same-sid op is already deferred (else run inline) — preserves both [data(seq), close] (data writes before close tears the session down) AND [close, data] (close runs first, data hits the closed session).
  • client_send_closed flag in tunnel_loop_pipelined so a TCP half-close from the local client stops queuing new ops but keeps draining already-queued reply oneshots — valid HTTP request/response with client-side shutdown no longer drops downlink bytes.
  • Effective batch timeout threaded through tunnel_batch_request_with_timeout (3-arg tunnel_batch_request_to kept as backward-compat wrapper) so both h2 (h2_relay_request) and h1 (read_http_response_with_timeout) honor the pipelined-batch floor — avoids inner 30 s default firing before our outer 60 s budget on user-tuned configs.
  • Phase 2's TCP-drain budget reservation capped at TCP_DRAIN_MAX_BYTES per iteration (mirrors the seq path), so concurrent seq jobs continue to see headroom while Phase 2's drain is in flight.

Public API change

TunnelMux::udp_open and udp_data already take impl Into<Bytes> (carried over from the prior zero-copy PR). tunnel_batch_request_to keeps its 3-arg signature; new tunnel_batch_request_with_timeout exposes the explicit timeout for callers that need it. Public structs TunnelResponse and BatchOp gain new optional fields with #[serde(default)] / #[serde(skip_serializing_if = "Option::is_none")] — old wire shapes deserialize cleanly and serialize identically.

Realistic perf delta

For active streaming downloads on a single TCP session, expect 30–80% throughput improvement depending on what fraction of the Apps Script RTT is server-side vs network. Less for idle / long-poll workloads (server-side seq enforcement serializes drains per session). No win on multi-session workloads (those were already parallel across sessions). Documented latency trade-off: a stuck seq op (lost earlier seq from a dropped batch) holds the batch HTTP response open for up to SEQ_WAIT_TIMEOUT = 30 s; the pinned regression test unrelated_seq_session_in_same_batch_is_not_delayed_past_seq_wait ensures unrelated sessions' intrinsic processing time stays sub-second within that window.

Test plan

  • cargo build --bins --lib clean both crates
  • cargo test --lib (client) + cargo test (tunnel-node) pass — 216 + 57 = 273 total
  • Server-side correctness: seq ordering across out-of-order arrivals, missed-notify race in wait_for_seq_turn, write-failure path bumps expected (SeqAdvanceOnDrop), tail preservation under cap, batch budget shared with seq jobs, EOF withholding while buffer non-empty, [data(seq), close] runs data before close, [close, data] runs close first
  • Server-side concurrency: BatchWait wakes all jobs on single push, dedupes watchers per unique inner, wakes across concurrent batches on the same session, mixed ready+idle batch latency, Phase 2 budget reservation capped at TCP_DRAIN_MAX_BYTES
  • Client-side: pipelined loop emits seq=0 first, in-seq-order output when replies arrive reversed, closes on seq mismatch, closes on seq=None (mixed-version detection), drains in-flight on client half-close, pipelined_reply_timeout covers 2× batch_timeout, mark_pipelining_disabled sticky and overrides caps
  • Manual: load a heavy site through full-tunnel mode and confirm no regressions on legacy non-pipelined sessions (mixed connect_data capability detection)
  • Manual: pipelined-capable session under heavy concurrent load — confirm no avoidable disconnects under semaphore saturation

@github-actions github-actions Bot added the type: feature feat: PR — auto-applied by release-drafter label May 8, 2026
@therealaleph
Copy link
Copy Markdown
Owner

Reviewed via Anthropic Claude.

Verified locally on top of v1.9.18:

  • cargo build --bins --lib --release: clean
  • cargo test --lib --release: 216/216 ✅
  • (cd tunnel-node && cargo test --release): 57/57 ✅

Wire-protocol design looks careful — #[serde(default)] + #[serde(skip_serializing_if = "Option::is_none")] on the new optional fields means old peers ignore the new fields and old shapes deserialize cleanly. The opt-in cap-bit (CAPS_PIPELINE_SEQ on connect-success) + sticky mark_pipelining_disabled on first seq=None reply is the right shape for mixed-deployment fleets where one Apps Script is running old Code.gs.

Hesitating on immediate merge because of two factors:

  1. Surface area: 3620 additions across the three highest-traffic files in the project (tunnel_client, domain_fronter, tunnel-node). Even with the test count this is the kind of PR where edge-case races only show up under real Iran-ISP latency + h2 connection drops + concurrent sessions on a busy VPS.
  2. Manual checkboxes still unchecked in the test plan: heavy load against a non-pipelined backend (mixed-version), and pipelined session under semaphore-saturation.

Plan: leaving open for 5–7 days of community testing like we did with #359 (drive queue). If two or three users on different ISPs report it stable for streaming downloads + heavy concurrent browsing, I'll squash-merge as v1.9.19.

For testers reading this: pull feature/tunnel-session-pipelining, run cargo build --release --features ui --bin mhrv-rs-ui, switch to Full mode, do a heavy download (Steam / large CDN file) + a few hours of normal browsing, and report back here with:

  • ISP / country
  • mhrv-rs version your tunnel-node deployment is running (so we know if you're testing pure-pipelined or mixed-version)
  • any disconnects, stalls, or out-of-order data the browser noticed (e.g. partial JS bundles, mid-stream HTTPS errors)

Thanks @dazzling-no-more — this is the kind of structural perf work that's hard to land safely. The seq-lock + capability-negotiation design is exactly the right architecture for what is ultimately a not-quite-ordered transport layer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type: feature feat: PR — auto-applied by release-drafter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants